专利摘要:
The present invention relates to a system for the extrapolation and statistical processing of data retrieved from one or more data sources which comprises a station (1) of a user equipped with means for generating a request for querying data sources , and a server (2) comprising means for receiving the request and querying one or more data sources (3) on the basis of the content of the request; the system having the peculiarity of organizing the data found in groups and of assigning statistical values and indices to such data and to these groups; the system being furthermore adapted to generate a visual information comprising the plurality of data found. The invention also relates to a method consistent with the system.
公开号:CH712818B1
申请号:CH01544/17
申请日:2016-06-17
公开日:2020-12-30
发明作者:Gatti Giulio
申请人:Gatti Giulio;
IPC主号:
专利说明:

[0001] The present invention relates to the field of information extraction and analysis, in particular a system for the extrapolation and statistical processing of data obtained from one or more data sources. Within the scope of the present invention, these retrieved data can be, for example, structured data, unstructured data, quantitative data, qualitative data, textual data, opinion data, or web data, and they can also have temporal and / or temporal enhancement. space.
[0002] Systems and studies are known that allow the systematic collection, storage and analysis of data from multiple sources with the aim of helping operators to make certain decisions, for example in terms of product distribution, advertising effectiveness, evaluation the risks associated with an investment or the appreciation by users of a specific product or service. Although access to information sources is generally not a problem, think for example of the large amount of opinions and comments expressed by consumers on the Internet, it is rather complicated to extrapolate from the enormous amount of data available the information that is actually useful, that is to select those that have a certain usefulness for a specific sector. Such research, which usually presupposes the choice of a statistical sample of the population to be analyzed and the use of certain search terms, often have unsatisfactory results. Too general searches can obtain a lot of data that must then be screened by operators with a great waste of time, while too specific searches can be counter-deficient or lead to misleading results. In fact, the choice of the search keywords could be wrong, it could not include relevant words because they were not taken into consideration or even unknown to the operator or still attribute to certain words a relatively lower importance than others.
[0003] The main aim of the present invention is to overcome the limitations of the prior art highlighted above, proposing a new system capable of allowing information to be retrieved from a plurality of data and information sources and processing them in such a way as to allow an analysis fast and effective.
[0004] Within the scope of this aim, the object of the present invention is to select, from among the publicly accessible information, the most relevant for a given topic.
[0005] A further object of the invention is to allow the correlation of information retrieved from various media channels.
[0006] Not least object of the present invention is to present a structure that is simple, relatively easy to implement in practice, safe in use and effective in operation, as well as having a relatively low cost.
[0007] This task and these and other purposes which will become clearer in the following, are achieved by a system for the extrapolation and statistical processing of data obtained from one or more data sources comprising:a user station equipped with means for generating a request for querying data sources, said request comprising one or more search keys;a server comprising means for receiving said request and querying one or more of said sources on the basis of the content of said request;said server further comprising:means for storing data retrieved from one or more of said sources in response to said query;means for attributing a ranking value to each datum of said retrieved data;means for grouping said data found in groups by means of a clustering engine comprising a clustering algorithm;said system being characterized in that said server further comprises:means for calculating a group statistical index for each of said groups, said group statistical index being calculated, for each of said groups, as the ratio between the sum of said ranking values of the respective data included in the group and the number of said data included in the group;means for ordering and organizing said retrieved data in a data structure and for showing visual information associated with said retrieved data which clearly shows those considered priority; said data being ordered on the basis of said group statistical indices calculated for said groups or on the basis of said ranking values or both.
[0008] The intended aim and objects are also achieved by a method for the extrapolation and statistical processing of data obtained from one or more sources, comprising the steps of:generating a data source query request comprising one or more search keys, by means of generating said request comprised in a station;querying one or more of said data sources on the basis of the content of said request, by means of querying said sources included in a server;storing data retrieved from said one or more sources in response to said query, by means for storing said retrieved data comprised in said server;attributing a ranking value to each datum of said retrieved data, by means of attributing said ranking value included in said server;grouping said retrieved data into groups by means of a clustering engine comprising a clustering algorithm, by means of grouping said retrieved data comprised in said server;calculate for each of said groups a group statistical index, said group statistical index being calculated, for each of said groups, as the ratio between the sum of said ranking values of the respective data included in the group and the number of said data included in the group, by means for calculating said group statistical index included in said server;order and organize said retrieved data in a data structure and show a visual information associated with said retrieved data which clearly shows those considered priority, said data being ordered on the basis of said group statistical indices calculated for said groups or on the basis of said ranking values or both, by means of ordering, organizing and showing said retrieved data included in said server.
[0009] Advantageously, the system according to the invention allows to process and archive large volumes of data, preferably of the textual type, retrieving them from various media channels, for example Newspaper, Social Network, Blogs.
[0010] Conveniently, the system according to the invention allows to provide useful data for conducting statistical studies on the modality in which information propagates through information channels such as online newspapers, Social Networks, Blogs, media and various sources of data on the Internet.
[0011] Advantageously, the system according to the invention allows a web data search to be carried out, but it can also reprocess the information in real time with the historicized information, possibly enriching it with other information coming from other data sources, such as for example data coming from electromechanical tools that send data in real time to the system, which then historicizes them.
[0012] Advantageously, the system according to the invention allows to study the geographic diffusion modality of an information.
[0013] Validly, the system according to the invention allows to study the financial markets by historicizing the main equity indices and securities, creating suitable statistical indices.
[0014] Further characteristics and advantages of the invention will become clearer from the following detailed description, given in the form of an illustrative and non-limiting example, accompanied by the relative figures in which: Figure 1 is a block diagram illustrating an embodiment of the system for the extrapolation and statistical processing of data obtained from one or more data sources according to the present invention; Figure 2 is a block diagram illustrating a detail of the embodiment of the system for the extrapolation and statistical processing of data retrieved from one or more data sources illustrated in Figure 1; Figure 3 is a flow chart illustrating the operation of an embodiment of the system for the extrapolation and statistical processing of data retrieved from one or more data sources according to the present invention; Figure 4 shows a first result, in particular an ordered tree, of an embodiment of the system for the extrapolation and statistical processing of data obtained from one or more data sources according to the present invention; Figure 5 shows a second result, in particular a word cloud, of an embodiment of the system for the extrapolation and statistical processing of data retrieved from one or more data sources according to the present invention; Figure 6 is a block diagram illustrating the Query Creator of an embodiment of the system for the extrapolation and statistical processing of data retrieved from one or more data sources according to the present invention; Figure 7 is a block diagram illustrating the process of creating the lists of topics, or Probes_Lists, of an embodiment of the system for the extrapolation and statistical processing of data retrieved from one or more data sources according to the present invention; Figure 8 is a block diagram illustrating the module comprising dictionaries, or Dictionary_Multi-Idioms, of an embodiment of the system for the extrapolation and statistical processing of data retrieved from one or more data sources according to the present found it; Figure 9 is a block diagram illustrating a query or query in a known type system; Figure 10 is a block diagram illustrating a query or query in an embodiment of the system for the extrapolation and statistical processing of data retrieved from one or more data sources according to the present invention.
[0015] An exemplary architecture of the system according to the present invention is summarized in the block diagram of Figure 1.
[0016] The system comprises a user station 1, a server 2 and a telecommunication network 4, such as the Internet, which connects the station 1 to the server 2 and the server 2 to a plurality of data sources 3.
[0017] The station of a user 1 comprises a terminal or "terminal computer" which allows processing of information generated for example on input or request of the user or operator and is capable of executing software instructions, in particular a software application for 'entry of a request to be forwarded to the server 2. This station of a user 1 is able to communicate, preferably through said software application, with the server 2. In the preferred embodiment, the station of a user 1 comprises a computer which comprises various hardware and software resources that allow, among other things, to provide data to the user, to receive input from the user, to reprocess this input and send it to the server 2. In the preferred embodiment the user of station 1 formulates such request in the form of textual data. For example, a user of station 1 could search for information relating to a specific topic or topic, relating to a news event, a fashion event, of a financial nature. Preferably, the user of station 1 formulates this request by identifying a list of relevant words or search keywords and sends this request to server 2.
[0018] The server 2 comprises known hardware architectures and is geographically arranged in a preferably remote position with respect to the station 1 with which it is able to communicate via the telecommunication network 4. The server 2 receives data from the station 1 and in particular requests interrogation of sources 3 relating to a specific area or theme.
[0019] The server 2 comprises at least the means 20, 21, 22, 23, 24. These means are preferably implemented as modules of a software application executable by the server 2.
[0020] The server 2 further comprises archiving means (block "Bigdata Architecture" 56 in Figure 6) suitable for storing all the information and data relating to the system for the extrapolation and statistical processing of data retrieved from one or more sources of data. In a preferred embodiment of the system 10 according to the invention, the filing means comprise a data base stored on suitably sized memory supports.
[0021] The means 20 are suitable for storing data retrieved from one or more sources 3 on the basis of the content of the request sent by station 1. In particular, the server 2 receives a request formulated by a user of station 1 and generates a query or query destined to one or more of the sources 3 to obtain a plurality of information from the network. The choice of sources 3 to be queried can be indicated in the request but can also be determined on the basis of various criteria. For example, the means 20 can then carry out this search using the main search engines, such as, by way of example but not limited to, Google and Yahoo, query the main online Newspapers (local, regional, national, European and global), Blogs, Social Networks, such as financial websites, public or private databases and the like. The means 20 also collect such retrieved data obtained by means of this search and store them in the storage media of the server 2 or in storage media accessible in any case from the server 2.
[0022] In a preferred embodiment of the invention, the server 2 further comprises a Query Creator able to generate, following the request formulated by the user, the query or optimized query intended for one or more of the sources 3 to retrieve the data.
[0023] The Query Creator uses customized probes, represented in the block diagram of Figure 6 by the "Probes" 50 block. data reprocessing, and in a totally automatic way without any human intervention.
[0024] Probes 50, in order to carry out their activities and therefore data census, require lists of topics, also called "Probes_Lists" 52 and 54. Probes_Lists 52, 54 define the behavior of Probes 50, that is: what they should look for , how to search for data, and where to search for the data. Probes Lists 52, 54 therefore define the activities that the probes must perform.
[0025] In the context of the query or query generated by the server 2, it can be defined as the content of the statistical search: a. a sentence (understood as a set of words ordered according to semantic logic criteria); b. a single alphanumeric word; or c. a set of bytes that make up the whole image or part of the image itself.
[0026] The data search and acquisition process performed by the server 2 comprises: a. public data on the Internet / Social Networks / Blogs; b. data present in shared and / or remote folders and / or on remote file systems; c. data coming from information flows originating from remote architectures.
[0027] It should be noted that the Probes 50 used in the system according to the invention search for data not only on the Internet, like others using a third Web Search Engine (for example Google or Yahoo), but also on shared and / or remote folders, remote servers and / or internal databases, such as for example hospital medical records, company databases, shared public folders, or electromechanical probes, in order to search as much data as possible relating to a single search topic defined by the query or query.
[0028] The block diagram of Figure 7 shows in detail the process by which the Probes_Lists 52, 54 are created. The data are registered in the system according to the invention on the basis of the source 3 of origin and the group to which they belong. Studying a phenomenon through the source of origin and the group to which it belongs allows to deepen the learning of social phenomena. Furthermore, it is possible to attribute to each type of data the right position within one or more categories; for example, a data or information can belong to both News, Finance and Fashion.
[0029] For example, if the system according to the invention is short of information for the study of a particular phenomenon, it is able to internally generate a query or query to search for the data, and pass it on to the probes which once they deal with the research of the data itself from the different data sources 3 available. In practice, if the system lacks information, it automatically creates queries without any human intervention.
[0030] The means 21, for attributing a value to each datum of said retrieved data, associate said retrieved and archived data with a certain value defined as "ranking" of the data. In particular, these means 21 detect the frequency, for example statistics, with which a given datum is present in the retrieved data. For example, in the event that the data are two newspaper articles relating to a specific news event and therefore such data are of the textual type or words, the means 21 can identify the words that occur more frequently in both texts. These words to be identified can also conveniently be not indicated in the request entered by the user of station 1: in this way, advantageously, it is possible to identify relevant words that have not been expressly indicated by the operator or are not attributable to the search keys contained in the operator request. In other words, while the search keys entered by the operator make it possible to find relevant information, the means 21 make it possible to identify in such information found additional data which may be important for the operator. The means 21 then attribute to the words that are present with greater frequency a certain value (for example, assuming that the attributable value is included in a range of integers ranging from 1 to 5, attributing the value 4 or 5) which is higher to the value attributed to words that occur less frequently or that are not present at all, (to which the value 1 or 2 could be associated, for example). Preferably the means 21 attribute to all the data a value which is included in a range of admissible values. Preferably this range comprises a range of positive integers. In one embodiment, all the data is initially assigned a ranking value equal to a predetermined or default value, for example 1, and this value is subsequently modified by means 21 on the basis of the calculated frequency. Preferably, the default value is the minimum of the values including admissible values. In one embodiment, the means 21 initially attribute to each datum a default value which is the minimum among those allowed and which is preferably further modified by the means 21 on the basis of the calculated frequency and again in a way better illustrated below, by the means 23.
[0031] In a preferred embodiment of the system according to the invention, the data found are analyzed by means of a Naive Bayesian classifier or Naive Bayes Classifier, a classifier based on the Bayes theorem, such that the conditional probabilities relating to the problem. In the below example model of Naive Bayes Classifier, the independence of the feature is assumed, i.e. it is summarized that the presence or absence of a feature in a data set is not correlated with the presence or absence of other features.The model is summarized as follows: 1. collect all the words and token elements that occur in the text; 2. I create the Vocabulary = distinct words + token; 3. estimate P (vj) and P (Wk | vj) where: a. P (Wk | vj) = probability of having a word k given the target value j; b. P (vj) = probability target value In pseudocode, given a Text T and a Vocabulary V, the model is the following:FOR EACH vi IN V DO:Docj = subset of T where target value = vjP (vj) = (| Docj |) / (| T |)Tj = document created by concatenating DocjN = total words and tokens in Tj, including duplicatesFOR EACH (word, token) IN V DO:* nk = word frequency, token in Text* P (Wk | vj) = (nk + 1) / (n + | V |)
[0032] The means 22 for grouping the retrieved data make it possible to identify correlations between the retrieved data and to divide such data into groups (or clusters, in English). In particular, for this purpose the means 22 comprise a clustering engine which receives as input any type of clustering algorithm. In the preferred embodiment, this correlation is achieved by means of a clustering algorithm of the K-means or K-medoids partition type. Said means 22 could therefore insert in the same group data to which the means 21 have attributed different values. For example, a data that has been given a default value could share the same group assigned to the data that has instead been given the maximum allowed value.
[0033] The means 23 for calculating a statistical index, make it possible to attribute a certain value, defined as a statistical group index, to each of the groups identified by the means 22. In particular, for this purpose the means 23 comprise a statistical engine which receives input any type of statistical algorithm. In the preferred embodiment, this value is given by the ratio between the sum of the values attributed by the means 21 to each of the data included in the group to which it belongs assigned by the means 22 and the number of data contained in the same group. In the preferred embodiment, the means 23 are also adapted to modify the value associated with a given datum by the means 21 on the basis of the value associated with the group to which this datum is assigned. For example, to the data to which the means 21 have associated a ranking value equal to the default value, for example 1, the means 23 instead assign a value equal to or based on the statistical index of the group, for example 4 or another value. Advantageously and if appropriate, it is possible to modulate the value initially attributed by the means 21 by adapting it to the value associated with the data having the same affinity, for example data to which the same group has been assigned. In this way the data can have a value and therefore a relevance that does not depend exclusively on the frequency of repetition of the data in the information (for example textual articles) retrieved from the sources 3 interrogated. In particular, it is also possible to modulate, partially modify or replace the ranking value of the data, in particular of the data to which a default value has been assigned, on the basis of the ranking value of the data of the assigned group or of the statistical index of the group. .
[0034] The means 24 allow the retrieved data to be sorted according to a determined criterion and to organize such data in a data structure, for example a priority list or an ordered tree. The data structure can be sorted according to various criteria, for example according to the group statistical index, the ranking value or both. An example of such an ordered tree is shown in Figure 4. Furthermore, the means 24 are suitable for showing a visual information associated with said list, in particular which clearly shows the data considered to be priority and therefore more relevant than those considered less important. An example of such a visualization in the form of a word cloud is shown in Figure 5.
[0035] A word cloud is a visual representation of keywords used within the amount of data found. Generally, the word cloud is presented in alphabetical order, with the peculiar characteristic of attributing a larger font to the most important words: it is therefore a weighted list. The weight of the words, which is rendered with characters of different sizes, is intended for example as the frequency of use within the retrieved data. The larger the character, the higher the keyword frequency.
[0036] In a preferred embodiment of the system according to the invention, the server 2 further comprises a module suitable for the Analysis of Human Language Data, therefore specific for the analysis of texts, called "Idioms".
[0037] The Idioms module allows to study the statistical outliers of the registered words. In particular, this module performs: a. the recognition of new words, and therefore the study of neologisms; b. learning new ways of communicating, linked for example to technological and social evolution that leads to new words or abbreviations every day.
[0038] In an even more preferred embodiment of the system according to the invention, the Idioms module is connected to a module 40 called "Dictionary_Multi-Idioms", preferably comprising both multi-language translation dictionaries 42 and thematic dictionaries 44. The dictionaries that make up the Dictionary_Multi-Idioms 40 are the element on which the text analysis process is based. Thematic dictionaries 44, also called "Topic Dictionary", are dictionaries in the language or idiom of the text, specific to the topic dealt with (for example medical dictionary, computer dictionary, and so on), and including an indication of the value of ranking of each word, which help the system in understanding the specific research topic, and which allow you to focus the analysis only on words that are relevant to the object of the research and / or statistical study.
[0039] The "stop word" process is carried out after the system according to the invention has recognized the language or idiom with which the text being processed has been written, and only after having loaded the correct thematic dictionary 44 into the system.
[0040] In an embodiment of the system according to the invention, the multi-language translation dictionaries 42 and thematic dictionaries 44 can be expanded both through self-learning, which occurs simultaneously with the analysis of the texts, and in manual mode, by means of the intervention of operators, both through links with universities of foreign languages.
[0041] With reference to the flow diagram of Figure 3, the operation of an embodiment of the system according to the invention will now be illustrated. For illustrative and non-limiting purposes, it will be assumed that the research topic or topic is a news event ("news"), relating to an event relating to the Bardo museum in Tunis and that the source 3 is Twitter.
[0042] At step 30, the search keys to be inserted in a request to be sent to server 2 are defined by the user of station 1. For example, these words can be Twitter hashtags of the type: #museum; #Tunis or #Tunisia; #attack; #victims; "World" and other words included in the set of words shown in Figure 4. It is assumed that the word "Bardo" is not included in the search terms. Preferably, the server 2 is able to translate these terms into other languages by means of suitable multi-language translation dictionaries 42, also using self-learning mechanisms and to automatically extend the search to sources 3 in a language different from that of the search keys. Therefore, in the example, server 2 is also able to retrieve the comments (tweets) of users that have been written in a language other than that of the search terms.
[0043] At step 31, the server 2 extrapolates from the request the search keys conveniently in addition to other parameters (for example the time that can be used for the search) and carries out the search. Preferably the research phase carried out by the server 2 provides for the preliminary creation of a script, for example based on the Knime software.
[0044] In step 32, the server 2 obtains from the queried source 3 and after a certain period of time, a plurality of data in response, for example comments published by various users on their Twitter profiles. The data obtained is subjected to a pre-processing phase in order to make the analysis easier, for example by eliminating punctuation, filtering certain terms, converting the characters into a format, for example from uppercase to lowercase and in general by carrying out operations of "stemming" through the use of special computer aids.
[0045] At step 33, the server 2 attributes a certain ranking value to each data obtained, this operation is performed by the means 21 which attribute, for example, the ranking value on the basis of the frequency with which a certain data occurs in the amount of data found. For example, as exemplified by the word cloud in Figure 5, the word "attack" which is assumed to be present with high frequency is attributed the value 5, the word "world" which is assumed to be present with less frequency is attributed the value 2, while the word "Bardo" which is not present in the search keys but is still present in the retrieved data (for example, with a frequency higher than a predetermined threshold) the default value 1 is assigned. Subsequently the means 22 assign each word to a specific group. For example, the word "Bardo" could be assigned to the same group as the word "attack". It is understood that these values are indicated for purely illustrative purposes. Subsequently, according to an embodiment, the means 21 modify the values of the words to which said means 21 have initially assigned a default value, based, for example, on the value of the words assigned to the same group. For example, the term "Bardo" could be given a value equal to the ratio between the sum of the values or data rank of the words or part of the words of the group to which they belong and the number of words of the group itself (for example the ratio could be calculated considering the words "tourists", "victims", "died", #news and #world "and their respective ranking values as shown in Figure 4 as equal to: (4 + 5 + 5 + 5 + 2) / ( 5) = 4.2 which could be approximated to 4) allow to modulate this value on the basis of the value attributed to the correlated data. Other criteria can be used to modify the ranking values associated with the data, in particular to data having initially associated a default value, for example, the corresponding statistical index of the group to which it belongs could be attributed.
[0046] At step 34, the means 24 show a word cloud, ie a visual information in which the words with a higher value, possibly modified as indicated in step 33, are shown in a highlighted manner with respect to the others. In Figure 5, an example of such a visualization is shown, where for example the word „museum“ appears more emphasized than the others (some of the words may appear in truncated form or modified on the basis of pre-processing operations).
[0047] It has thus been shown that the described method and system achieve the intended aim and objects. In particular, it has been seen how the system thus conceived makes it possible to overcome the qualitative limits of the prior art, allowing searches to be made easier, to facilitate data processing and to reduce efforts to select the most pertinent ones. In this way, the system advantageously allows the retrieval of information relating to various thematic areas and to focus attention on the most important data, attributing an appropriate weight / value to such data. The visual information obtained allows, for example, an operator to concentrate the writing of a possible article or the creation of reports („report“) by giving greater weight to the most relevant words identified. The system thus obtained, also thanks to the choice of sources 3 located in different places, allows to study the social phenomena connected to the diffusion of a news, to study the ways in which the propagation of information takes place and on the basis of these studies to carry out analysis relating to the reactions due to the dissemination of the news itself.
[0048] Clearly, numerous modifications are evident and can be readily carried out by the person skilled in the art without departing from the scope of protection of the present invention.
[0049] For example, it is obvious to those skilled in the art that the system can also be used to create statistical models that allow to anticipate the fluctuations of the main stock market indices, to carry out risk management studies, to easily write scientific articles and to provide tools to aid scientific research.
Therefore, the scope of the claims is not to be limited by the illustrations or preferred embodiments illustrated in the description in the form of examples, but rather the claims are to encompass all of the patentable novelty features that reside in the present invention, including all features which would be treated as equivalent by the person skilled in the art.
[0051] The content of the Italian patent application No. 102015000024569 (UB2015A001469) whose priority is claimed in the present application, is incorporated by reference.
[0052] Where the technical characteristics in the claims are followed by numerical references and / or abbreviations, said numerical references and / or abbreviations have been added for the sole purpose of increasing the intelligibility of the claims and therefore said numerical references and / or abbreviations are not produce no effect on the scope of each element identified only as an indication by said numerical references and / or abbreviations.
权利要求:
Claims (8)
[1]
1. System for the extrapolation and statistical processing of data obtained from one or more data sources comprising:- a station (1) of a user equipped with means for generating a request for querying data sources (3), said request comprising one or more search keys;- a server (2) comprising means adapted to receive said request and to query one or more of said sources (3) on the basis of the content of said request;said server (2) further comprising:- means (20) suitable for storing data retrieved from one or more of said sources (3) in response to said query;- means (21) for attributing a ranking value to each datum of said retrieved data;- means (22) for grouping said data found in groups by means of a clustering engine comprising a clustering algorithm;said system being characterized in that said server (2) further comprises:- means (23) for calculating a group statistical index for each of said groups, said group statistical index being calculated, for each of said groups, as the ratio between the sum of said ranking values of the respective data included in the group and the number of said data included in the group;- means (24) for ordering and organizing said retrieved data in a data structure and for showing a visual information associated with said retrieved data which clearly shows those considered priority; said data being ordered on the basis of said group statistical indices calculated for said groups or on the basis of said ranking values or both.
[2]
2. System according to claim 1, characterized in that said means (21) for attributing said ranking value are adapted to attribute a ranking value equal to, or based on, a repetition frequency, said repetition frequency being calculated as the ratio between the number of repetitions of said data between said retrieved data and the total number of said data retrieved in response to said query.
[3]
3. System according to claim 1 or 2, characterized in that said search keys and said retrieved data are of the textual type.
[4]
4. System according to claim 3, characterized in that said server (2) is configured to translate said search keys into other languages, automatically extending said query to sources (3) in a language different from the language of said search keys.
[5]
5. System according to one of claims 1 to 4, characterized in that said means (21) for attributing said ranking value are further adapted to assign a predefined ranking value to each data found without a correspondence in said search keys .
[6]
6. System according to one of claims 1 to 5, characterized in that said means (21) for attributing said ranking value are further adapted to assign to each data found without a correspondence in said search keys a ranking value equal a, or based on, said statistical index of the group to which said data belongs.
[7]
7. System according to one of claims 1 to 6, characterized in that said clustering algorithm is of the K-means or K-medoids partition type.
[8]
8. Method for the extrapolation and statistical processing of data obtained from one or more sources, by means of a system according to one of claims 1 to 7, comprising the steps of:- generating a request for querying data sources (3) comprising one or more search keys, by means of generating said request included in a station (1);- querying one or more of said sources (3) of data on the basis of the content of said request, by means of querying said sources (3) included in a server (2);- storing data retrieved from said one or more sources (3) in response to said query, by means (20) for storing said retrieved data included in said server (2);- attributing a ranking value to each datum of said retrieved data, by means (21) for attributing said ranking value included in said server (2);- grouping said retrieved data into groups by means of a clustering engine comprising a clustering algorithm, by means (22) for grouping said retrieved data comprised in said server (2);- calculating for each of said groups a group statistical index, said group statistical index being calculated, for each of said groups, as the ratio between the sum of said ranking values of the respective data included in the group and the number of said data included in the group, by means (23) for calculating said group statistical index included in said server (2);- ordering and organizing said retrieved data in a data structure and showing a visual information associated with said retrieved data which clearly shows those considered priority, said data being ordered on the basis of said group statistical indices calculated for said groups or on the basis of said ranking values or both, by means (24) for ordering, organizing and showing said retrieved data included in said server (2).
类似技术:
公开号 | 公开日 | 专利标题
US20210165955A1|2021-06-03|Methods and systems for modeling complex taxonomies with natural language understanding
US20120136649A1|2012-05-31|Natural Language Interface
Al-Kabi et al.2016|A prototype for a standard arabic sentiment analysis corpus.
US9619555B2|2017-04-11|System and process for natural language processing and reporting
Hu et al.2011|Enhancing accessibility of microblogging messages using semantic knowledge
US10678820B2|2020-06-09|System and method for computerized semantic indexing and searching
Zainuddin et al.2016|Improving twitter aspect-based sentiment analysis using hybrid approach
Tayal et al.2016|Fast retrieval approach of sentimental analysis with implementation of bloom filter on Hadoop
Tiwari et al.2019|Ensemble approach for twitter sentiment analysis
Cao et al.2019|Extracting statistical mentions from textual claims to provide trusted content
Saleiro et al.2013|Popstar at replab 2013: Name ambiguity resolution on twitter
CN107688616B|2021-07-09|Make the unique facts of the entity appear
Sharma et al.2015|Tourview: Sentiment based analysis on tourist domain
CH712818B1|2020-12-30|System and method for the extrapolation and statistical processing of data obtained from one or more data sources.
Guo et al.2018|Query expansion based on semantic related network
Cha et al.2015|CBDIR: Fast and effective content based document Information Retrieval system
Ramdani et al.2018|Selecting user influence on twitter data using skyline query under mapreduce framework
Cherichi et al.2016|Big data analysis for event detection in microblogs
Chaabene et al.2016|Semantic Annotation for the “on demand graphical representation” of variable data in Web documents
Swezey et al.2012|Automatic detection of news articles of interest to regional communities
Smatana et al.2016|Topic modeling over text streams from social media
Dambhare et al.2017|Smart map for smart city
Cherichi et al.2016|Using big data values to enhance social event detection pattern
Saito et al.2018|Automatic labeling to classify news articles based on paragraph vector
Srivastava et al.2014|An Algorithm for Summarization of Paragraph Up to One Third with the Help of Cue Word Comparison
同族专利:
公开号 | 公开日
WO2016203446A1|2016-12-22|
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US6795820B2|2001-06-20|2004-09-21|Nextpage, Inc.|Metasearch technique that ranks documents obtained from multiple collections|
WO2015000083A1|2013-07-05|2015-01-08|Anysolution, Inc.|System and method for ranking online content|
法律状态:
优先权:
申请号 | 申请日 | 专利标题
ITUB20151469|2015-06-17|
PCT/IB2016/053619|WO2016203446A1|2015-06-17|2016-06-17|System for extrapolation and statistical processing of data that can be acquired from one or more data sources|
[返回顶部]